The idea of the recipes package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”) before we build our models.
Import data and split the data into training and testing sets using initial_split()
library(tidyverse)
library(tidymodels)
ames <- read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/ames.csv")
ames <- ames %>%
select(-matches("Qu"))
set.seed(100)
new_split <- initial_split(ames)
new_train <- training(new_split)
new_test <- testing(new_split)Next, we use a recipe() to build a set of steps for data preprocessing and feature engineering.
recipe() what our model is going to be (using a formula here) and what our training data is.step_novel() will convert all nominal variables to factors.prep() the recipe(). This means we actually do something with the steps and our training data.ames_rec <-
recipe(Sale_Price ~ ., data = new_train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
prep() # put recipe into action
# Show the content of our recipe
ames_rec## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 73
##
## Training data contained 2198 data points and no missing data.
##
## Operations:
##
## Novel factor level assignment for MS_SubClass, MS_Zoning, Street, ... [trained]
## Dummy variables from MS_SubClass, MS_Zoning, Street, Alley, ... [trained]
## Zero variance filter removed MS_SubClass_new, ... [trained]
## Centering and scaling for Lot_Frontage, Lot_Area, ... [trained]
Print a summary of our recipe:
To obtain the Dataframe from the recipe, we use the function juice():